De-anonymizing Programmers via Code Stylometry

نویسندگان

  • Aylin Caliskan
  • Richard E. Harang
  • Andrew Liu
  • Arvind Narayanan
  • Clare R. Voss
  • Fabian Yamaguchi
  • Rachel Greenstadt
چکیده

Source code authorship attribution is a significant privacy threat to anonymous code contributors. However, it may also enable attribution of successful attacks from code left behind on an infected system, or aid in resolving copyright, copyleft, and plagiarism issues in the programming fields. In this work, we investigate machine learning methods to de-anonymize source code authors of C/C++ using coding style. Our Code Stylometry Feature Set is a novel representation of coding style found in source code that reflects coding style from properties derived from abstract syntax trees. Our random forest and abstract syntax tree-based approach attributes more authors (1,600 and 250) with significantly higher accuracy (94% and 98%) on a larger data set (Google Code Jam) than has been previously achieved. Furthermore, these novel features are robust, difficult to obfuscate, and can be used in other programming languages, such as Python. We also find that (i) the code resulting from difficult programming tasks is easier to attribute than easier tasks and (ii) skilled programmers (who can complete the more difficult tasks) are easier to attribute than less skilled programmers.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries

The ability to identify authors of computer programs based on their coding style is a direct threat to the privacy and anonymity of programmers. Previous work has examined attribution of authors from both source code and compiled binaries, and found that while source code can be attributed with very high accuracy, the attribution of executable binary appears to be much more difficult. Many pote...

متن کامل

Recognizing and Imitating Programmer Style: Adversaries in Program Authorship Attribution

Source code attribution classifiers have recently become powerful. We consider the possibility that an adversary could craft code with the intention of causing a misclassification, i.e., creating a forgery of another author’s programming style in order to hide the forger’s own identity or blame the other author. We find that it is possible for a non-expert adversary to defeat such a system. In ...

متن کامل

Git Blame Who?: Stylistic Authorship Attribution of Small, Incomplete Source Code Fragments

Program authorship attribution has implications for the privacy of programmers who wish to contribute code anonymously. While previous work has shown that complete files that are individually authored can be attributed, we show here for the first time that accounts belonging to open source contributors containing short, incomplete, and typically uncompilable fragments can also be effectively at...

متن کامل

De-anonymizing Authors of Electronic Texts: A Survey on Electronic Text Stylometry

Electronic text stylometry is a collection of forensics methods that analyze the writing styles 1 of input electronic texts in order to extract information about authors of the input electronic texts. 2 Such extracted information could be the identity of the authors, or aspects of the authors, such as 3 their gender, age group, ethnicity, etc. This survey paper presents the following contributi...

متن کامل

Can Pseudonymity Really Guarantee Privacy?

One of the core challenges facing the Internet today is the problem of ensuring privacy for its users. It is believed that mechanisms such as anonymity and pseudonymity are essential building blocks in formulating solutions to address these challenges and considerable eeort has been devoted towards realizing these primitives in practice. The focus of this eeort, however, has mostly been on hidi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015